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Abstract 

Markov jump processes (or continuous-time Markov chains) are a simple and important 
class of continuous-time dynamical systems. In this paper, we tackle the problem of simu- 
lating from the posterior distribution over the unobserved paths in these models given some 
observations. Our approach is an auxiliary variable Gibbs sampler, and is based on the idea 
of uniformization. This sets up a Markov chain over paths by alternately sampling a finite 
set of virtual jump times given the current path and then sampling a new path given the set 
of extant and virtual jump times using a standard hidden Markov model forward filtering- 
backward sampling algorithm. Our method is exact and does not involve approximations like 
time-discretization. We demonstrate how our sampler extends naturally to MJP-based models 
like Markov-modulated Poisson processes and continuous-time Bayesian networks and show 
significant computational benefits over state-of-the-art MCMC samplers for these models. 

Keywords: Markov jump process, uniformization, MCMC, Gibbs sampler, Markov-modulated 
Poisson process, continuous-time Bayesian network 



1 Introduction 



The Markov jump process (MJP) extends the discrete-time Markov chain to continuous time, and 
forms a simple and popular class of continuous-time dynamical systems. In Bayesian modelling 
applications, the MJP is widely used as a prior distribution over the piecewise-constant evolution 
of the state of a system. The Markov property of the MJP makes it both a realistic model for 
various physical and chemical systems, as well as a convenient approximation for more complex 
phenomena in biology, finance, queuing systems etc. In chemistry and biology, stochastic kinetic 



models use the state of an MJP to represent the sizes of various interacting species (e.g. Gillespie 



( |1977[ ); |Golightly and Wilkinson| ( |2011[ )). In queuing applications, the state may represent the 
number of pending jobs in a queue ( |Breuer[ |2003[ |Tijms[ [1986), with the arrival and processing 
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of jobs treated as memoryless events. MJPs find wide application in genetics, for example, an 
MJP trajectory is sometimes used to represent a segmentation of a strand of genetic matter. Here 
'time' represents position along the strand, with particular 'motifs' occurring with different rates 
in different regions (Fearnhead and Sherlock 2006| ). MJPs are also widely used in finance, for 
example, Elliott and Osakwe (2006) use an MJP to model switches in the parameters that govern 
the dynamics of stock prices (the latter being modelled with a Levy process). 

In the Bayesian setting, the challenge is to characterize the posterior distribution over MJP 
trajectories given noisy observations; this typically cannot be performed analytically. Various 
sampling-based ( |Fearnhead and Sherlock] |20T)6] |Boys et al.[ |200"8| |E1-Hay et at] [2008| |Fan and 



Sheltonj [2008| |Hobolth and Stone} [2009 ;) and deterministic ( |Nodelman eTaLj [20021 12005| |Opper 



and Sanguinetti[[2007t|Cohn et al.[|2010| ) approximations have been proposed in the literature, but 



come with problems: they are often generic methods that do not exploit the structure of the MJP, 
and when they do, involve expensive computations like matrix exponentiation, matrix diagonal- 
ization or root-finding, or are biased, involving some form of time-discretization or independence 
assumptions. Moreover, these methods do not extend easily to more complicated likelihood func- 
tions which require specialized algorithms (for instance, the contribution of Fearnhead and Sher- 



lock (2006) is to develop an exact sampler for Markov-modulated Poisson processes (MMPPs), 



where an MJP modulates the rate of a Poisson process). 

In this work, an extension of |Rao and Teh| ( |20 1 1 a| ) , we describe a novel Markov chain Monte 
Carlo (MCMC) sampling algorithm for MJPs that avoids the need for the expensive computations 
described previously, and does not involve any form of approximation (i.e. our MCMC sampler 
converges to the true posterior). Importantly, our sampler is easily adapted to complicated ex- 
tensions of MJPs such as MMPPs and continuous-time Bayesian networks (CTBNs) (Nodelman 



et al. 2002 ), and is significantly more efficient than the specialized samplers developed for these 
models. Like many existing methods, our sampler introduces auxiliary variables which simplify 
the structure of the MJP, using an idea called uniformization. Importantly, unlike some existing 
methods which produce independent posterior samples of these auxiliary variables, our method 
samples these conditioned on the current sample trajectory. While the former approach can be 
hard for complicated likelihood functions, ours results in a simple distribution over the auxiliary 
variables that is independent of the observations. The latter are accounted for during a straight- 
forward discrete-time forward- filtering backward- sampling step to resample a new trajectory. The 
overall structure of our algorithm is that of an auxiliary variable Gibbs sampler, alternately re- 
sampling the auxiliary variables given the MJP trajectory, and the trajectory given the auxiliary 
variables. 

In section [2] we briefly review Markov jump processes. In section [3] we introduce the idea of 
uniformization and describe our MCMC sampler for the simple case of a discretely observed MJP. 
In section [4} we apply our sampler to the Markov-modulated Poisson process, while in section [5} 
we describe continuous-time Bayesian networks, and extend our algorithm to that setting. In both 
sections, we report experiments comparing our algorithm to state-of-the-art sampling algorithms 
developed for these models. We end with a discussion in section[6j 
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2 Markov Jump Processes (MJPs) 



S(t) = (s ,S,T) 
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Figure 1: (left) An MJP path (s , S 1 , T), (right) a uniformized representation (t> , V, W). 

A Markov jump process (S(i), t G M+) is a stochastic process with right-continuous, piecewise- 
constant paths (see for example <^i nlar| ( T975| )). The paths themselves take values in some count- 
able space (S, £$), where £s is the discrete cr-algebra. As in typical applications, we assume S 
is finite (say S = {1,2, ...N}). We also assume the process is homogeneous, implying (together 
with the Markov property) that for all times t, t' G M + and states s, s' £ 5, 



p (S(f + 1) = s|S(f ) = s', (S(u), u < 0) = t^L 



(1) 



for some stochastic matrix P 4 that depends only on t. The family of transition matrices (P t , t > 0) 
is defined by a matrix A 6 m NxN called the rate matrix or generator of the MJP. A is the time- 
derivative of P t at £ = 0, with 



P t = exp(At) 



p(S(f + dt) = s|S(0 = s' 



(for s ^ s') 



(2) 
(3) 



where is the matrix exponential. The off-diagonal elements of A are non-negative, and represent 
the rates of transiting from one state to another. Its diagonal entries are A s = A ss = — Yls'^s ^ s ' s 
for each s, so that its columns sum to 0, with — A s characterising the total rate of leaving state s. 

Consider a time interval T = [t start, tend], with the Borel a-algebra £7-. Let ir be a density 
with respect to the counting measure fi s on (S, £5); this defines the initial distribution over states 
at t start- Then an MJP is described by the following generative process over paths on this interval 
(called Gillespie's algorithm (Gilles pie] |1977[ )): 
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Algorithm 1 Gillespie's algorithm to sample an MJP path on the interval [t start , t en( i 



Input: The rate matrix A and the initial distribution over states 7r . 

Output: An MJP trajectory S(t) = (s , S, T). 

1: Assign the MJP a state s ~ n . Set to = t start and z = 0. 

2: loop 

3: Draw 2 ~ exp(|A.J). 

4: If tj + z > t end then return (s , • • • , s*, ti, • • • , U) and stop 
5: Increment z and let U = + 2. 

6: The MJP jumps to a new state s« = s at time £j, for an s 7^ Sj_i, 
7: with probability proportional to A ss . _ 1 . 

8: end loop 



If all event rates are finite, an MJP trajectory will almost surely have only a finite number of 
jumps. Let there be n jumps, and let these occur at the ordered times (ti, - • • ,t n ). Define T = 
(ii, • ■ • ,t n ), and let S = (si, - • • , s n ) be the corresponding sequence of states, where Sj = S(ij). 
Along with the initial state so, (so 3 S, T) completely characterizes the MJP trajectory over T (figure 
□(left)). 

From Gillespie's algorithm, we see that sampling an MJP trajectory involves sequentially sam- 
pling n + 1 waiting times from exponential densities with one of iV rates, and n new states from 
one of N discrete distributions, each depending on the previous state. The ith waiting time equals 
(U — ti-i) and is drawn from an exponential with rate |A Si _J, while the probability the ith state 
equals Sj is A SiS . _ 1 /\A Si _ 1 1. The last waiting time can take any value greater than t en( i — t n . Thus, 
under an MJP, a random element (s , S, T) has density 

p(s ,S,T) = n (s ) (fll^le-^i-^-^^^A . e -\AsJ(t md -t n ) (4) 

n^-i-i exp - / \ A m\ dt ) (5) 



,i=l 



tstar 



The density above is defined w.r.t. a base measure that we define next (see (Daley and Vere-Jones 



2008 ) for more details). Let /1-7- be Lebesgue measure on T. Recalling that the state space of the 



MJP is S, we can view (S, T) as a sequence of elements in the product space M. = S x T. Let 
E» and hm = lis x A*r be me corresponding product cr-algebra and product measure. Define 
.M n as the n-fold product space with the usual product cr-algebra and product measure [i n M . 
Now let Ai u = Ui^o be a un i° n space, elements of which represent finite length pure -jump 
paths^| Let be the corresponding union cr-algebra, where each measurable set B £ can be 
expressed as B = W^LqB 1 with B l = B fl A'!' G S^. Assign this space the measure iP M defined 
as: oo 

li u M (B) = J2^M(B t ) (6) 



i=0 



'Define M° as a point satisfying At x M = M x .M = (Daley and Vere-Jones 2008 1. 
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Then, any element (s ,S,T) £ S x .M u sampled from Gillespie's algorithm has density w.r.t. 
jis x /i^ given by equation (|5]). 



3 MCMC inference via uniformization 

In this paper, we are concerned with the problem of sampling MJP paths over the interval T = 
{t start, tend] given noisy observations of the state of the MJP. In the simplest case, we observe 
the state of the process at the boundaries t start and t end . More generally, we are given the ini- 
tial distribution over states n as well as a set of O noisy observations X = {Xq, ...X t ° } at 
times T° = {t°, . . . , t° } with likelihoods p(X t o\S{t°)) and we wish to sample from the posterior 
p(so, S, T\X). Here we have implicitly assumed that the observation times T° are fixed. Some- 
times the observation times themselves can depend on the state of the MJP, resulting effectively 
in continuous-time observations. This is the case for the Markov-modulated Poisson process and 
CTBNs. As we will show later, our method handles these cases quite naturally as well. 

A simple approach to inference is to discretize time and work with the resulting approxima- 
tion. The time-discretized MJP corresponds to the familiar discrete-time Markov chain, and its 
Markov structure can be exploited to construct dynamic programming algorithms like the forward- 
filtering backward-sampling algorithm to sample posterior trajectories efficiently. However, time- 
discretization introduces a bias into our inferences, as the system can change state only at a fixed 
set of times, and as the maximum number of state changes is limited to a finite number. To control 
this bias, one needs to discretize time at a fine granularity, resulting in long Markov chains, and 
expensive computations. 

Recently, there has been growing interest in constructing exact MCMC samplers for MJPs 



without any approximations such as time-discretization. We review these in section 3.3 One 
class of methods exploits the fact an MJP can be exactly represented by a discrete-time Markov 
chain on a random time-discretization. Unlike discretization on a regular grid, a random grid can 
be quite coarse without introducing any bias whatsoever. Given this discretization, we can use 
the forward-backward algorithm to perform efficient sampling. However, we do not observe the 
random discretization, and thus also need to sample this from its posterior distribution. Such an 
approach depends on the likelihood process, and a number of algorithms attempt to solve this prob- 
lem for specific observation processes. Our approach is to resample the discretization conditioned 
on the system trajectory. As we will see this is independent of the likelihood process, resulting in 
a simple, flexible and efficient MCMC sampler. 

3.1 Uniformization 

We first introduce the idea of uniformization (Jensen] 1953; Cinlar[ 1975 Hobolth and Stone 



2009), which forms the basis of our sampling algorithm. For an MJP with rate-matrix A, choose 
some Vt > max s \A S \. Let W = (wi, . . . ,W\w\) be an ordered set of times on the interval 
[tstart, t en d] drawn from a homogeneous Poisson process with intensity Q. W constitutes a ran- 
dom discretization of the time-interval [t start , t end ]. 
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Next, letting / be the identity matrix, observe that B = (I + ^A) is a stochastic matrix (it has 
nonnegative elements, and its columns sum to one). Run a discrete-time Markov chain with initial 
distribution % and transition matrix B on the times in W; this is a Markov chain subordinated to 
the Poisson process. The Markov chain will assign a set of states (t> , V), vq at the initial time t start , 
and V = (f i, . . . , v\v\) at the discretization times W. In particular, v is drawn from 7r , while Vi is 
drawn with probabilities given by the v^th column of B. Just as (s , 5*, T) characterizes an MJP 
path, (t> , V, W) also characterizes a sample path of some piecewise-constant, right-continuous 
stochastic process on [t sta rt, tend}- Observe that the matrix B allows self-transitions, so that unlike 
S, V can jump from a state back to the same state. We treat these as virtual jumps, and regard 
(v o, V, W) as a redundant representation of a pure -jump process that always jumps to a new state 
(see figure [TJ (right)). The virtual jumps provide a mechanism to 'thin' the set W, thereby rejecting 
some of its events. This corrects for the fact that since the Poisson rate f2 dominates the leaving 
rates of all states of the MJP, W will on average contain more events than there are jumps in the 
MJP path. As the parameter increases, the number of events in W increases; at the same time 
the diagonal entries of B start to dominate, so that the number of self-transitions (thinned events) 
also increases. The next proposition shows that these two effects exactly compensate each other, 
so that the process characterized by (v , V, W) is precisely the desired MJP: 



Proposition 1 (Jensen ( 1953)). For any Vt > max s \ A S \, (s , S, T) and (v 0} V, W) define the same 



Markov jump process S(£). 

Proof. We follow [Hobolth and Stone| (|2009). From equation (Q, the marginal distribution of the 



MJP at time t is given by 

7T 4 = exp(At)7i 



= exp (Q(B - I)t) 7T 

= exp (—Qt) exp (QtB) ttq 

= E(( ex P(- fit )^)( 5n7r o)) (7) 

The first term in the summation is the probability that a rate Poisson produces n events in 
an interval of length t, i.e. that \W\ = n. The second term gives the marginal distribution over 
states for a discrete-time Markov chain after n steps, given that the initial state is drawn from 
7r , and subsequent states are assigned according to a transition matrix B. Summing over n, we 
obtain the marginal distribution over states at time t. Since the transition kernels induced by the 
uniformization procedure agree with those of the Markov jump process (exp (At)) for all t, and 
since the two processes also share the same initial distribution of states, 7r , all finite dimensional 



distributions agree. Following Kolmogorov's extension theorem (Kallenberg, 2002), both thus 
define versions of the same stochastic process. 

□ 

A more direct but cumbersome approach is note that (t> , V, W) is also an element of the space 
S x A4 U . We can then write down its density p(vo, V, W) w.r.t. fis X At u , and show that marginal- 
izing out the number and locations of self-transitions recovers equation ([5]). While we do not take 



6 



this approach here, we will derive the density p(v , V, W) for use later. As in section [2} let T u and 
S u denote the measure spaces consisting of finite sequences of times and states respectively, and 
let fj,j- and fig be the corresponding base measures. The Poisson realization W is determined by 
waiting times sampled from a rate f2 exponential distribution, so that following equation (|5]), W 
has density w.r.t. fij- given by 



p(W) = Q\ w \ e ~ n<yteni ~ tstart ' > 



(8) 



Similarly, from the construction of the Markov chain, it follows that the state assignment 
(vq, V) has probability density w.r.t. fi s x ji^ given by 



p(v ,V\W) = 7r (v )l[ 1 + 



1=1 



A 



VjVi-1 



Since under uniformization \ V\ = \W\,it follows that 

/j%(dV) x i4(dW) = /j}p(dV) x ff\dW) 
= (li T xvs) ]V] (d(V,W)) 
= ^ M (d(V,W)). 

Thus, from equations ([8]) and (|9]), (v , V, W) has density w.r.t. fi s x jj^ given by 



\v 



(•^DitJi-l) 



1=1 



(9) 



(10) 



(11) 



3.2 The MCMC algorithm 

We adapt the uniformization scheme described above to construct an auxiliary variable Gibbs 
sampler. Recall that the only difference between (s , S, T) and (vq, V, W) is the presence of an 
auxiliary set of virtual jumps in the latter. Call the virtual jump times Uj-; associated with Uj- 
is a sequence of states Us uniquely determined by (s ,S,T) and Uj- (see figure [TJright)). Let 
U = (Us, Uj-), and observe that (s , S, T, U) and (v , V, W) represent the same point in /x 5 x \j^ M . 



(S,T) 



(V,W) 



(V,W) 



Figure 2: Uniformization-based Gibbs sampler: starting with an MJP trajectory (left), resample the 
thinned events (middle) and then resample the trajectory given all Poisson events (right). Discard 
the thinned events and repeat. 
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Each iteration of the MCMC algorithm then proceeds as follows. Given an MJP trajectory 
(sq, S, T) (figure [5] (left)), we proceed by first sampling the set of virtual jumps Uj- given (s , S, T), 
as a result recovering the uniformized characterization (s , V, W) (figure [2] (middle)). This corre- 
sponds to a random discretization of [t sta rt, tend] at times W. We now discard the state sequence 
V, and perform a simple HMM forward-filtering backward-sampling step to resample a new state 
sequence V. Finally, dropping the virtual jumps in (s , V, W) gives a new MJP path (s , S, T). 

The next proposition shows that conditioned on (s , S,T), the virtual jump times Uj- are dis- 
tributed as an inhomogeneous Poisson process with intensity R{t) = Vt + As(t)- This intensity is 
piecewise-constant, taking the value = + A Si on the interval [U,t i+ i) (with t = t sta rt and 
t n+1 = t end ), so it is easy to sample Uj and thus U. 

Proposition 2. For any Q > max s (—A s ), both (s , S, T, U) and (vq, V, W) have the same density 
w.r.t. us x /i^. In other words, the Markov jump process (sq,S,T) along with virtual jumps 
U drawn from the inhomogeneous Poisson process as above is equivalent to the times W being 
drawn from a Poisson process with rate Q, followed by the states (v ,V) being drawn from the 
subordinated Markov chain. 

Proof. Let n = \T\ be the number of jumps in the current MJP trajectory. Define \Ui\ as the 
number of auxiliary times in the interval (£j, t i+1 ). Then, \Uf\ = ^™ =0 \ Ui\. If Uj- is sampled from 
a piecewise-inhomogeneous Poisson process, then from equation ([8]) it has density 

p(U T \s ,S,T)= (f[(tt + Aj u Aex P (- \n + A s{t) )dt\ (12) 

w.r.t. fij- Since Us is deterministically given by U r and (s , S, T), U = (U s , U T ) given (s , S, T) 
has the same density as equation ( fT2] ), but now w.r.t. pt° M . Multiplying equations (|5]) and ( fT2] ), we 



C)\U\+n n / A \ \Ut\ n . 

4—n \ / 4— 1 



see that (s , S, T, U) has density 

n\U\+n n / 4 \ 1^1 n 4 

- n(i+#) n - 

i=0 v 7 i=l 

n n 

= e-"^-*-*^ )7r ( S0 ) n + n (i3) 

i=0 1=1 

w.r.t. us x /-t^ x A*1m • However, by definition, 

^ M (d(S,T)) x ^ M (dU) = ^(d(S,T)) x $(dU) 

= $} +m {d{S,T,U)) 

= /i u M (d(S,T,U)) (14) 



Comparing with equation (11 ), and recognizing that \Ui\ is the number of self-transitions in the 



interval U), we see both are equal, giving the desired result. □ 
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We can now incorporate the likelihoods coming from the observations X. Firstly, note that 
by assumption X depends only on the MJP trajectory (s , S, T) and not on the auxiliary jumps 
U, thus the conditional distribution of Uj- given (s , S, T, X) is still the inhomogeneous Poisson 
process given above. Now let X\ W . ;W \ represent the observations in the interval [w{, u>i+i) (taking 
w\w\+i — tend)- Throughout this interval, the MJP is in state Vi, giving a likelihood term: 

Li(vi) = p(X [WiiWi+1 )\S(t) =Vifort e [Wi,w i+1 )) (15) 
For the case of noisy observations of the MJP state at a discrete set of times T°, this simplifies to 

U{Vi)= J] P (X t a\S(t°)= Vi ) (16) 

Conditioned on the times W, (s , V) is a Markov chain with initial distribution 7r , transition 



matrix B and likelihoods given by equation (fT3]). We can efficiently resample (s , V) using the 



standard forward filtering-backward sampling algorithm. This cost of this is 0(N 2 \ V\), quadratic 
in the number of states and linear in the length of the chain. Further, any structure in A (e.g. 
sparsity) is inherited by B and can be exploited easily. Let (§q, V) be the new state sequence. 
Then (s , V, W) will correspond to a new MJP path S(t), characterized by (s , S, T) by discarding 
virtual jumps from (V", W). This completes the description of our sampler, which we summarize 
in algorithm [2| 

Algorithm 2 Blocked Gibbs sampler for an MJP on the interval [t start , t end ] 

Input: A set of observations X, and parameters A (the rate matrix), 

7r (the initial distribution over states) and f2 > max s (— A s ). 

The previous MJP path, S(t) = (s , S, T). 
Output:A new MJP trajectory S(t) = (s , S, f). 

1: Sample Uj- C [t sta rt, tend] from a Poisson process with piecewise-constant rate 

R{t) = (Q + As®) 

Define W = T U Uj- (suitably reordered). 
2: Sample a path (s , V) from a discrete-time Markov chain with 1+ 1 W\ steps using the forward- 
backward algorithm. The transition matrix of the Markov chain is 

while the initial distribution over states is n . The likelihood of state s at step i is 

Li(s) = p(X[ WhWi+1 )\S(t) = s for t G [w h w i+ i)) 

3: Let T be the set of times in W when the Markov chain changes state. Define S as the corre- 
sponding set of state values. Return (s , S, T). 
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Proposition 3. The auxiliary variable Gibbs sampler described above has the posterior distribu- 
tion p(sq, S, T\X) as its stationary distribution. Moreover, if 
Q > max s \A a \, the resulting Markov chain is ergodic. 

Proof. The first statement follows since the algorithm simply introduces auxiliary variables U, and 
then conditionally samples V given X and W. To show ergodicity, note that if > max s (- A s ), 
then the intensity of the subordinating Poisson process is strictly positive. Thus, there is positive 
probability density of sampling appropriate auxiliary jump times U and moving from any MJP 
trajectory to any other. □ 

Note that it is essential for Vt to be strictly greater than max s \A S \; equality is not sufficient for 
ergodicity. For example, if all diagonal elements of A are equal to Vt, then the Poisson process for 
Uq- will have intensity 0, so that the set of jump times T will never increase. 

Since the new state sequence V is independent of V given W, the only dependence between 
successive MCMC samples arises because the new candidate jump times include the old jump 
times i.e. T C W. This means that the new MJP trajectory has non-zero probability of making a 
jump at a same time as the old trajectory. Increasing introduces more virtual jumps, and as T 
becomes a smaller subset of W, we get faster mixing. Of course, increasing makes the HMM 
chain grow longer, leading to a linear increase in the computational cost per iteration. Thus the 
parameter allows a trade-off between mixing rate and computational cost. We will study this 



trade-off in section 3.5 In all other experiments, we set Q = max s (— 2A S ) as we find this works 



quite well, with the samplers typically converging after less than 5 iterations. 

3.3 Previous posterior sampling schemes 

A simple approach when the MJP state is observed at the ends of an interval is rejection sampling: 
sample paths given the observed start state via Gillespie's algorithm, and reject those that do not 



end in the observed end state (Nielsen 2002). We can extend this to noisy observations by impor- 



tance sampling or particle filtering (Fan and Shelton, 2008). Recently, Golightly and Wilkinson 



(|2011[) have applied particle MCMC methods to correct the bias introduced by standard particle 



filtering methods. However, these methods are efficient only in situations where the data exerts a 
relatively weak influence on the unobserved trajectory (compared to the prior): a large state-space 
or an unlikely end state can result in a large number of rejections or small effective sample sizes. 

A second approach, more specific to MJPs, integrates out the infinitely many paths of the MJP 
in between observations using matrix exponentiation (equation (|2])), and uses forward-backward 
dynamic programming to sum over the states at the finitely many observation times (see (Hobolth 



and Stone] [2009) for a review). Unfortunately, matrix exponentiation is an expensive operation 



that scales as 0(N 3 ), cubically in the number of states. Moreover, the matrix resulting from matrix 
exponentiation is dense and any structure, e.g. sparsity, in the rate matrix A cannot be exploited. 
A third approach is, like ours, based on the idea of uniformization ( Hobolth and Stone] 2009| ). 



This proceeds by producing independent posterior samples of the Poisson events W in the interval 
between observations, and then (like our sampler) running a discrete-time Markov chain on this set 
of times to sample a new trajectory. However, sampling from the posterior distribution over W is 
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not easy, depending crucially on the observation process, and usually requires a random number of 
0(N 3 ) matrix multiplications (as the sampler iterates over the possible number of Poisson events). 
By contrast, instead of producing independent samples, ours is an MCMC algorithm. At the price 
of producing dependent samples, our method scales as 0(N 2 ), does not require matrix exponen- 
tiations, easily exploits structure in the rate matrix and naturally extends to various extensions of 
MJPs. Moreover, we demonstrate that our sampler mixes very rapidly. 



3.4 Bayesian inference on the MJP parameters 

In this section we briefly describe how full Bayesian analysis can be performed by placing priors on 



the MJP parameters A and n and sampling them as part of the MCMC algorithm. Like Fearnhead 



and Sherlock (2006), we place independent gamma priors on the (negative) diagonal elements 
of A and independent Dirichlet priors on the transition probabilities. In particular, for all s let 

p s i s = A s i s /\A S \ and define the prior: 

\A S \ ~ Gamma(ai, 02) (17) 
{Ps's, sV«)~ Dirichlet (/3) (18) 

This prior is conjugate, with sufficient statistics for the posterior distribution given a trajectory S(t) 
being the total amount of time T s spent in each state s and the number of transitions n s / s from each 
s to s'. Then, 

\A S \ I (s Q ,S,T) ~ Gamma(«i + J2 S '^ S n s's, «2 + T S ) (19) 
(Ps's, s) I (s , S, T) ~ Dirichlet (/3 + {n s , S) s' ^ s)) (20) 

It is important to note that we resample the rate matrix A conditioned on (sq,S,T), and not 
(vq, V, W). A new rate matrix A implies a new uniformization rate VL, and in the latter case, 
we must also account for the probability of the Poisson events W under Vt. Besides being more 
complicated, this coupling between W and Vt can slow down mixing of the MCMC sampler. Thus, 
we first discard the thinned events U, update A conditioned only on the MJP trajectory, and then 
resample the thinned events under the new parameters. We can view the sampler of algorithm [2] as 
a transition kernel K,a((so, S, T), (s , S, T)) that preserves the posterior distribution under the rate 
matrix A. Our overall sampler then alternately updates (s , S, T) via the transition kernel ICa(-, •), 
and then updates A given (s , S, T). 

Finally, we can either fix ir or set it equal to the stationary distribution of the MJP with rate 



matrix A. In the latter case, equations (19) and (20) serve as a Metropolis-Hastings proposal. We 



accept a proposed A sampled from this distribution with probability equal to the probability of 
the current initial state under the stationary distribution of A. Note that computing this stationary 
distribution requires solving an 0(N 3 ) eigenvector problem, so that in this case, the overall Gibbs 
sampler scales cubically even though algorithm [2] scales quadratically. 



3.5 Experiments 

We first look at the effect of the parameter on the mixing on the MCMC sampler. We generated 
a random 5-by-5 matrix A (with hyperparameters ai = oli = f3 = 1), and used this to generate an 
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MJP trajectory with a uniform initial distribution over states. The state of this MJP trajectory was 
observed via a Poisson process likelihood model (see section [4]), and posterior samples given the 
observations and A were produced by a C++ implementation of our algorithm. 1000 MCMC runs 
were performed, each run consisting of 10000 iterations after a burn-in of 1000 iterations. For each 
run, the number of transitions as well as the time spent in each state was calculated, and effective 
sample sizes (ESSs) of these statistics were calculated using R-CODA ( |Plummer et al. [ [20 06 ) . The 
'overall' ESS of a run is defined to be the median ESS across all these ESSs. 

Figure [3] (left) plots the overall ESS against computation time per run, for different scalings 
k, where = fcmax s \A S \. We see that increasing does increase the mixing rate, however the 
added computational cost quickly swamps out any benefit this might afford. Figure [3] (right) is a 
similar plot for the case where we also performed Bayesian inference for the MJP parameter A as 
described in section |3~4l Now the overall ESS of an MCMC run is defined to be the median ESS 
of all MJP parameters. Interestingly, in this scenario ESSs are fairly insensitive to f2, suggesting 
an 'MCMC within Gibbs' update as proposed here using dependent trajectories is as effective as 
one using independent trajectories. We found this to be true in general: when embedded within 
an outer MCMC sampler, our sampler produced similar effective ESSs as an MJP sampler that 
produces independent trajectories. The latter is typically more expensive, and in any case, we will 
show that the computational savings provided by our sampler far outweigh the cost of dependent 
trajectories. 
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Figure 3: Effective sample sizes vs computation times for different scalings of Q for (left) a fixed 
rate matrix A and (right) Bayesian inference on the rate matrix. Whiskers are quartiles over 1000 
runs. 



In light of figure [3j for all subsequent experiments we set Q = 2 max s \A S \. Figure [4] shows the 
initial burn-in of a sampler with this setting for different initializations. The vertical axis shows the 
number of state transitions in the MJP trajectory of each iteration. This quantity quickly reaches 
its equilibrium value within a few iterations. 
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MCMC iteration number 

Figure 4: Trace plot of the number of MJP transitions for different initializations. Black lines are 
the maximum and minimum number of MJP transitions for each iteration, over all initializations. 



4 Markov-modulated Poisson processes 

A Markov modulated Poisson process (MMPP) is a doubly- stochastic Poisson process whose in- 
tensity function is piecewise-constant and distributed according to a Markov jump process. Sup- 
pose the MJP (S(t),t e [t start, tend}) has iV states, and is parametrized by an initial distribution 
over states 7r and a rate matrix A. Associate with each state s a nonnegative constant X s called 
the emission rate of state s. Let O be a set of points drawn from a Poisson process with piecewise- 
constant rate R(t) = As(t). Note that O is unrelated to the subordinating Poisson process from 
the uniformization-based construction of the MJP, and we call it the output Poisson process. The 
Poisson observations O effectively form a continuous-time observation of the latent MJP, with 
the absence of Poisson events also informative about the MJP state. MMPPs have been used to 
model phenomenon like the distribution of rare DNA motifs along a gene (Fearnhead and Sherlock 
(2006)), photon arrival in single molecule fluorescence experiments ( |Burzykowski et alj |2003j ), 
web page requests ( |Scott and Smyth[|2003[ ) etc. 



Fearnhead and Sherlock ( 2006| ) developed an exact sampler for MMPPs based on a dynamic 



program for calculating the probability of O marginalizing out the MJP trajectory. The dynamic 
program keeps track of the probability of the MMPP emitting all Poisson events prior to a time t 
and ending in MJP state s. The dynamic program then proceeds by iterating over all Poisson events 
in O in increasing order, at each iteration calculating probabilities using matrix exponentiation. A 
backward sampling step then draws an exact posterior sample for the MJP trajectory (S(t), t E O) 
at the times in O. Finally a uniformization-based endpoint conditioned MJP sampler is used to fill 
in the MJP trajectory between every pair of times in O. 

The main advantage of this method is that it produces independent posterior samples. It does 
this at the price of being fairly complicated and computationally intensive. Moreover, it has the 
disadvantage of operating at the time scale of the Poisson observations rather than the dynamics of 
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the latent MJP: for high Poisson rates, the number of matrix exponentiations will be high, leading 
to an inefficient algorithm. 

Our MCMC sampler outlined in the previous section can be straightforwardly extended to the 
MMPP without any of these disadvantages. Resampling the auxiliary jump events (step 1 in algo- 
rithm^ remains unaffected, since conditioned on the current MJP trajectory, they are independent 
of the observations O. Step 2 requires calculating the emission likelihoods Lj(s), which is simply 
given by: 

Li(s) = (X s ) m exp (-A S K +1 - O) , (21) 
\Oi \ being the number of events in O in the interval [wi, w i+ i). Note that evaluating the likelihood 



plj ) requires only the number of observed Poisson events between every successive pair of times 
in W. Compared to our algorithm, the approach of Fearnhead and Sherlock (2006 1 is much more 
involved and inefficient. 



4.1 Experiments 

In the following, we compare a C++ implementation of our algorithm with an implementation^ 



of the algorithm of Fearnhead and Sherlock (2006), coded in C. We performed fully Bayesian 



inference, sampling both the MJP parameters (as described in section 3.4) and the Poisson rates X s 
(conjugate gamma priors were placed on these). In all instances, our algorithm did significantly 
better, the performance improvement increasing with the complexity of the problem. 




Figure 5: CPU time to produce a 100 effective samples as we observe (left) increasing number of 
Poisson events in an interval of length 10 (centre), 10 Poisson events over increasing time intervals 
and (right) increasing intervals with the number of events increasing on average. 



In the first set of experiments, the dimension of the latent MJP was fixed to 5. The prior on 
the rate matrix A had parameters aii = a 2 — /3 — 1 (see section 3.4). The shape parameter of the 
gamma prior on the emission rate of state s, X s , was set to s (thereby breaking symmetry across 
states); the scale parameter was fixed at 1. 10 draws of O were simulated using the MMPP. For 
each observed O, both MCMC algorithms were run for 1000 burn-in iterations followed by 10000 

2 Downloaded from Chris Sherlock's webpage. 
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iterations where samples were collected. For each run, the ESS for each parameter was estimated 
using R-CODA, and the overall ESS was defined to be the median ESS over all parameters. 

Figure [5] reports the average computation times required by each algorithm to produce 100 
effective samples, under different scenarios. The leftmost plot shows the computation times as 
a function of the numbers of Poisson events observed in an interval of fixed length 10. For our 
sampler, increasing the number of observed events leaves the computation time largely unaffected, 
while for the sampler of|Fearnhead and Sherlock|(|2006), this increases quite significantly. This 



reiterates the point that our sampler works at the time scale of the latent MJP, while Fearnhead and 



Sherlock (2006) work at the time scale of the observed Poisson process. In the middle plot, we 



fix the number of observed Poisson events to 10, while increasing the length of the observation 
interval instead, while in the rightmost plot, we increase both the interval length and the average 
number of observations in that interval. In both cases, our sampler again offers increased efficiency 
of up to two orders of magnitude. In fact, the only problems where we observed the sampler of 



Fearnhead and Sherlock (2006) to outperform ours were low-dimensional problems with only a 



few Poisson observations in a long interval, and with one very unstable state. A few very stable 
MJP states and a few very unstable ones results in a high uniformization rate f2 but only a few state 
transitions. The resulting large number of virtual jumps can make our sampler inefficient. 




dimension 



Figure 6: CPU time to produce 100 effective samples as the MJP dimension increases 



In figure [6j we plot the time to produce 100 effective samples as the number of states of the 
latent MJP increases. Here, we fixed the number of Poisson observations to 10 over an interval 
of length 10. We see that our sampler (plotted with squares) offers substantial speed-up over the 
sampler of Fearnhead and Sherlock (2006) (plotted with circles). We see that for both samplers 
computation time scales cubically with the latent dimension. However, recall that this cubic scaling 
is not a property of our MJP trajectory sampler; rather it is a consequence of using the equilibrium 
distribution of a sampled rate matrix as the initial distribution over states, which requires calculat- 
ing an eigenvector of a proposed rate matrix. If we fix the initial distribution over states (say to 
the discrete uniform distribution), giving the line plotted with inverted triangles in the figure, we 
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Figure 7: The predator-prey network (left) and the drug-effect CTBN (right) 



observe that our sampler scales quadratically. 



5 Continuous-time Bayesian Networks (CTBNs) 



Continuous-time Bayesian networks (CTBNs) are compact, multi-component representations of 
MJPs with structured rate matrices (Nodelman et al. 2002). Special instances of these models 
have long existed in the literature, particularly stochastic kinetic models like the Lotka-Volterra 
equations, which describe interacting populations of animal species, chemical reactants, gene reg- 
ulatory networks etc (Wilkinson 2009| ). There have also been a number of related developments, 
see for example [Bolch et aL ( 1998 1 or Didelez|([2008). For concreteness however, we shall focus on 



CTBNs, a formalism introduced in Nodelman et al. (2002) to harness the representational power 
of Bayesian networks to characterize structured MJPs. 

Just as the familiar Bayesian network uses a product of conditional probability tables to rep- 
resent a much larger probability table, so too a CTBN represents a structured rate matrix with 
smaller conditional rate matrices. An m-component CTBN represents the state of an MJP at time 
t with the states of m nodes S 1 (t), . . . , S m (t) in a directed (and possibly cyclic) graph Q. Figure 
[7] shows two CTBNs, the 'predator-prey network' and the 'drug-effect network'. The former is 
a CTBN governed by the Lotka-Volterra equations (see subsection |5.3.1[), while the latter is used 



to model the dependencies in events leading to and following a patient taking a drug ( |Nodelman 
20021 ). 



et al. 



Intuitively, each node of the CTBN acts as an MJP with an instantaneous rate matrix that 
depends on the current configuration of its parents (and not its children, although the presence of 
directed cycles means a child can be a parent as well). The trajectories of all nodes are piecewise 
constant, and when a node changes state, the event rates of all its children change. The graph Q and 
the set of rate matrices (one for each node and for each configuration of its parents) characterize the 
dynamics of the CTBN, the former describing the structure of the dependencies between various 
components, and the latter quantifying this. Completing the specification of the CTBN is an initial 
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t t + dt 

Figure 8: Expanded CTBN 



t + 2dt 



distribution 7r over the state of nodes, possibly specified via a Bayesian network. 

It is convenient to think of a CTBN as a compact representation of an expanded (and now 
acyclic) graph, consisting of the nodes of Q repeated infinitely along a continuum (viz. time). In 
this graph, arrows lead from a node at a time t to instances of its children at time t + dt. Figure [8] 
displays this for a section of the drug-effect CTBN. The rates associated with a particular node at 
time t + dt are determined by the configuration of its parents at time t. Figure[8]is the continuous- 
time limit of a class of discrete-time models called dynamic Bayesian networks (DBNs) (Murphy, 
2002). In a DBN, the state of a node at stage i + 1 is dependent upon the configuration of its 
parents at stage i. Just as MJPs are continuous-time limits of discrete-time Markov chains, CTBNs 
are also continuous-time limits of DBNs. 

It is possible to combine all local rate matrices of a CTBN into one global rate matrix (see 
Nodelman et al. ( 2002| )), resulting in a simple MJP whose state-space is the product state-space 
of all component nodes. Consequently, it possible, conceptually at least, to directly sample a 
trajectory over an interval [t start, tend] using Gillespie's algorithm. However, with an eye towards 
inference, algorithm [3] describes a generative process that exploits the structure in the graph Q. 
Like section [2j we represent the trajectory of the CTBN, S(t), with the initial state s and the 
pair of sequences (S, T). Let the CTBN have m nodes. Now, Si, the ith element of S, is an m- 
component vector representing the states of all nodes at £j, the time of the ith jump. We write 
this as Si — (sj, ■ ■ ■ , sj 1 ). The rate matrix of a node k varies over time as the configuration of its 
parents changes, and we will write A k,t for the relevant matrix at time t. Following equation ([5]), 
we can write down the probability density of (s , S, T) as 



p(s ,S,T) = 7To(s 




exp 



E 

k=l 



(22) 
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Algorithm 3 Algorithm to sample a CTBN trajectory on the interval [t start , t end 



Input: The CTBN graph Q, a set of rate matrices {A} for all nodes and for all 

parent configurations and an initial distribution over states 7r . 
Output: A CTBN trajectory S(t) = (s , S, T). 



1: Draw an initial configuration s = (sj, Sq, ...) ~ n . Set t = t start and i = 0. 

2: loop 

3: For each node k, draw z k ~ expd^^ |)- 

4: Let k i+ i = argmin fc z k be the first node to change state. 
5: If ti + z ki+1 > t end then return (s , . . . , Sj, h, . . . , U) and stop. 
6: Increment i and let ti = + z ki be the next jump time. 
7: Let s' = s^t 1 be the previous state of node k t . 
8: Set s ki = s with probability proportional to A^'f 1 ' 1 for each s ^ s'. 
9: Set s k = sf_i f or all k ^ h L . 

10: end loop 



5.1 Inference in CTBNs 



We now consider the problem of posterior inference over trajectories given some observations. 
Write the parents and children of a node k as V(k) and C(k) respectively. Let M.B(k) be the 
Markov blanket of node k, which consists of its parents, children, and the parents of its children. 
Given the entire trajectories of all nodes in A4B(k), node k is independent of all other nodes in the 



network (Nodelman et al. 2002) (see also equation ( [24] ) below). This suggests a Gibbs sampling 
scheme where the trajectory of each node is resampled given that of its Markov blanket. This 
approach was followed by El-Ha y~et al.| (|2008). 

However, even without any associated observations, sampling a node trajectory conditioned 
on the complete trajectory of its Markov blanket is not straightforward. To see this, rearrange the 



terms of equation ( 22 ) to give 



(>xp[-/ \A%t {t) \dt) (23) 



p(s , S, T) = 7c (s ) J] <P(S k , T k \s , S ^ k \T^) 

k=l 

<P(S k ,T k \s ,S^ k \T^)= (ft A tt 

where for any set of nodes B, (s^, S b ,T b ) represents the associated trajectories. Note that the 
0(-) terms are not conditional densities given parent trajectories, since the graph Q can be cyclic. 
We must also account for the trajectories of node A;'s children, so that the conditional density of 
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{s k ,S k ,T k ) is actually 



p(s k ,S k ,T k \s^ k ,S- k ,T- k ) ocn (s k o\so k mS\T k \s ,S^ k \T r ^) 

■ n 0(s c ,T c i so ,^ c \T^) 



(24) 



cec(k) 



Here -ik denotes all nodes other than k. Thus, even over an interval of time where the parent 
configuration remains constant, the conditional distribution of the trajectory of a node is not a 
homogeneous MJP because of the effect of the node's children, which act as 'observations' that 
are continuously observed. If A c,t is constant over t, the corresponding </>(■) is the density of an 
MJP given the initial state. Since A c,t varies in a piecewise-constant manner, the </>(•) term is 
actually the density of a piecewise-inhomogeneous MJP. Effectively, we have a 'MJP-modulated 
MJP', so that the inference problem here is a generalization of that for the MMPP of section |4} 
El-Hay et"aL] ( |20Q8j ) described a matrix-exponentiation-based algorithm to update the trajectory 



of node k. At a high-level their algorithm is similar to that of |Fearnhead and Sherlock| ( |2006| ) for 
MMPPs, with the Poisson observations of the MMPP generalized to transitions in the trajectories 
of child nodes. Consequently, it uses an expensive forward-backward algorithm involving matrix 



exponentiations. In addition, El-Hay et al. (2008) resort to discretizing time to obtain the transition 
times. 



5.2 Auxiliary Variable Gibbs sampling for CTBNs 

We now show how our uniformization-based sampler can easily be adapted to conditionally sam- 
ple a trajectory for node k without resorting to approximations. In the following, for notational 
simplicity, we will drop the superscript k whenever it is clear from context. Recall that the MJP 
trajectory (s , S, T) for node k has a uniformized construction from a subordinating Poisson pro- 
cess. The piecewise constant trajectories of the parents of k imply that the MJP is piecewise 
homogeneous, and we will use a piecewise constant rate Vt l which dominates the associated transi- 
tion rates, i.e. fl* > | -A*'* | for all s. This allows the dominating rate to 'adapt' to the local transition 
rates, and is more efficient when, e.g. the transition rates associated with different parent configu- 
rations are markedly different. Recall also that our algorithm first reconstructs the thinned Poisson 
events Uj- using a piecewise homogeneous Poisson process with rate (fl*+A g 'L), and then updates 
the trajectory using the forward-backward algorithm (so that W = T U Uj- forms the candidate 
transitions times of the MJP). 

In the present CTBN context, just as the subordinating Poisson process is inhomogeneous, so 
too the Markov chain used for the forward-backward algorithm will have different transition matri- 
ces at different times. In particular, the transition matrix at a time Wi (where W = (wi, . . . , w\w\)) 
is 

Ak,Wi 

B > = 1 + n=r < 25 > 

Finally, we need also to specify the likelihood function L$(s) accounting for the trajectories 
of the children in addition to actual observations in each time interval [wj, lUi+i). From equations 
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< |2~3| ) and ([24]), this is given by 




Li(s) = L?(s) | | | || A^ lexpf-/ (26) 

where Lf (s) is the likelihood coming from actual observations dependent on the state of node k in 
the time interval. Note that the likelihood above depends only on the number of transitions each of 
the children make as well as how much time they spend in each state, for each parent configuration. 

The new trajectory S k (t) is now obtained using the forward-backward algorithm, with the 
given inhomogeneous transition matrices and likelihood functions. The following proposition now 
follows directly from our previous results in section [3j 

Proposition 4. The auxiliary variable Gibbs sampler described above converges to the posterior 
distribution over the CTBN sample paths. 

Note that our algorithm produces a new trajectory that is dependent, through T, on the previous 
trajectory (unlike a true Gibbs update as in El-Hay et al. ( |2008[ ) where they are independent). 



However we find that since the update is part of an overall Gibbs cycle over nodes of the CTBN, 
the mixing rate is actually dominated by dependence across nodes. Thus a true Gibbs update has 
negligible benefit towards mixing, while being more computationally costly. 

5.3 Experiments 

In the following, we evaluate a C++ implementation of our algorithm on a number of CTBNs. As 
before, the dominating rate fi* was set to max s 2|A^*|. 

5.3.1 The Lotka-Volterra process 

We first apply our sampler to the Lotka-Volterra process ( [Wilkinson] |2009| Opper and SanguinettiJ 



2007). Commonly referred to as the predator-prey model, this describes the evolution of two 
interacting populations of 'prey' and 'predator' species. The two species form the two nodes of 
a cyclic CTBN (figure [7] (left)), whose states x and y represent the sizes of the prey and predator 
populations. The process rates are given by 

A ({x, y} -> {x + 1, y}) = ax A ({x, y} ->• {x - 1, y}) = /3xy 

A ({x, y} ->■ {x, y + 1}) = 5xy A ({x, y} ->■ {x, y - 1}) = 73/ 

where we set the parameters as follows: a = 5 x 10~ 4 , (3 = 1 x 10~ 4 , 7 = 5 x 10 -4 , 5 = 1 x 10~ 4 . 
All other rates are 0. This defines two infinite sets of infinite-dimensional conditional rate matrices. 



In its present form, our sampler cannot handle this infinite state-space. Like Opper and Sanguinetti 



(2007), we limit the maximum number of individuals of each species to 200, leaving us with 400 
rate matrices of size 200 x 200. Note that these matrices are tridiagonal and very sparse: at any 
time the size of each population can change by at most one. Consequently, the complexity of 
our algorithm scales linearly with the number of states (we did not modify our code to exploit 
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Figure 9: Posterior (mean and 90% confidence intervals) over (left) prey and (right) predator paths 
(observations (circles) were available only until 1500). 

this structure, though this is straightforward). A 'true' path of predator-prey population sizes was 
sampled from this process, and its state at time t = was observed noiselessly. Additionally 15 
noisy observations were generated, spaced uniformly at intervals of 100. The noise process was: 



p(X(t)\S(t)) oc 



2|x(t)-s(t)| + io-6 



(27) 



Given these observations (as well as the true parameter values), we approximated the posterior 
distribution over paths by two methods: using 1000 samples from our MCMC sampler (with a 



burn-in period of 100) and using the mean-field (MF) approximation of Opper and Sanguinetti 
(2007 3 Figure w\ shows the true paths (in black), the observations (as circles) as well as the 



posterior means and 90% confidence intervals produced by the two algorithms for the prey (left) 
and predator (right) populations. As can be seen, both algorithms do well over the first half of the 
interval where data is present. In the second half, the MF algorithm appears to underestimate the 
predicted size of the predator population. On the other hand, the MCMC posterior reflects the true 
trajectory better. In general, we found the MF algorithm to underestimate the posterior variance in 
the MJP trajectories, especially over regions with few observations. 



5.4 Average relative error vs number samples 



For the remaining experiments, we compared our sampler with the Gibbs sampler of El-Hay et al. 
( |2008[ ). For this comparison, we used the CTBN-RLE package of |Shelton et ~aL\ ( |2010| ) (also 
implemented in C++). In all our experiments, as with the MMPP, we found our algorithm to be 
significantly faster, especially for larger problems. To prevent details of the two implementations 
from clouding the picture and to reiterate the benefit afforded by avoiding matrix exponentiations, 
we also measured the amount of time CTBN-RLE spent exponentiating matrices. This constituted 



3 We thank Guido Sanguinetti for providing us with his code. 
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Figure 10: Average relative error vs number of samples for 1000 independent runs; burn-in = 
200. Note that in this scenario Uniformization was about 12 times faster, so that for the same 
computational effort it produces significantly lower errors. 



between 10% to 70% of the total running time of their algorithm. In the plots we refer to this as 
'El Hay et al. (Matrix Exp.)'. We found that our algorithm took less time than even this. 

In our next experiment, we followed El-Hay et al. (2008) in studying how average relative 
error varies with the number of samples from the Markov chain. Average relative error is defined 

by Ylj ^g 9 ^ , and measures the total normalized difference between empirical (9j) and true (9j) 
averages of sufficient statistics of the posterior. The statistics in question are the time spent by each 
node in different states as well as the number of transitions from each state to the others. The exact 
values were calculated by numerical integration when possible, otherwise from a very long run of 
CTBN-RLE. 



As in El-Hay et al. (2008 1, we consider a CTBN with the topology of a chain, consisting of 5 
nodes, each with 5 states. The states of the nodes were observed at times and 20 and we produced 
endpoint-conditioned posterior samples of paths over the time interval [0, 20]. We calculate the 
average relative error as a function of the number of samples, with a burn-in of 200 samples. Figure 



10 shows the results from running 1000 independent chains for both samplers. Not surprisingly, the 



sampler of El-H ay et al.| ( |2008| ), which produces conditionally independent samples, has slightly 
lower errors. However the difference in relative errors is minor, and is negligible when considering 
the dramatic (sometimes up to two orders of magnitude; see below) speed improvements of our 
algorithm. For instance, to produce the 10000 samples, the El-Hay et al.| ( |2008 ) sampler took about 
6 minutes, while our sampler ran in about 30 seconds. 



5.5 Time requirements 

In the next three experiments, we compare the times required by CTBN-RLE and our uniformization- 
based sampler to produce 100 effective samples for CTBNs of different configurations. These times 
were estimated from runs of 10000 samples after a burn-in period of 1000 iterations. Since CTBN- 
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Figure 1 1 : CPU time vs (left) length of CTBN chain (centre) number of states of CTBN nodes 
(right) time interval of CTBN paths. 



RLE does not support Bayesian inference for CTBN parameters, we kept these fixed and produced 
ESS estimates from the number of transitions of each node and the amount of time spent in each 
state. The overall ESS is again the median ESSs (see section 4. 1 for details). Each MCMC run pro- 
duced samples from an endpoint-conditioned CTBN with random gamma distributed parameters 
and each point in the figures is an average over 10 simulations. 

In the first of these experiments, we measured the times to produce these samples for the chain- 
shaped CTBN described above, as the number of nodes in the chain increases. The leftmost plot 
in figure 1 1 shows the results. As might be expected, the time required by our algorithm grows 
linearly with the number of nodes. For |El-Hay et al. ( |2008 ), the cost of the algorithm grows faster 
than linear, and quickly becoming unmanageable. The time spent calculating matrix exponentials 
does grow linearly, however our uniformization-based sampler always takes less time than even 
this. 

Next, we kept the length of the chain fixed at 5, instead increasing the number of states per 
node. As seen in the middle plot, once again, our sampler is always faster. Asymptotically, we 
expect our sampler to scale as 0(N 2 ) and El-Hay et al. (2008 1 as 0(N 3 ). While we have not hit 
that regime yet, we can see that the cost of our sampler grows more slowly with the number of 
states. 

Our final experiment, reported in the right plot, measures the time required as the interval 
length t en d — tstart increases. For this experiment, we used the drug-effect network shown in figure 
[7J where the parameters were set to standard values (obtained from CTBN-RLE) and the state of 
the network was fully observed at the beginning and end times. Again, our algorithm is the faster 
of the two, showing a linear increases in computational costs with the length of the interval. It 
is worth pointing out here that the algorithm of El- Hay et al.| (2008) has a 'precision' parameter, 
and that by reducing the desired temporal precision, faster performance can be obtained. However, 
since our sampler produces exact samples (up to numerical precision), our comparison is fair. In 
the above experiments, we left this parameter at its default value. 
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6 Discussion 



We proposed a novel Markov chain Monte Carlo sampling method for Markov jump processes. 
Our method exploits the simplification of the structure of the MJP resulting from the introduction 
of auxiliary variables via the idea of uniformization. This constructs a Markov jump process by 
subordinating a Markov chain to a Poisson process. This can be viewed as running a Markov chain 
on a random discretization of time. Our sampler is a blocked Gibbs sampler in this augmented 
representation and proceeds by alternately resampling the discretization given the Markov chain 
and vice versa. Experimentally, we find that this auxiliary variable Gibbs sampler is computation- 
ally very efficient. The sampler easily generalizes to other MJP-based models, and we presented 
samplers for Markov-modulated Poisson processes and continuous-time Bayesian networks. In 
our experiments, we showed significant speed-up compared to state-of-the-art samplers for both. 

Our method opens a number of avenues worth exploring. One concerns the subordinating 
Poisson rate which acts as a free-parameter of the sampler. While our heuristic of setting this to 
max s 2\A S \ worked well in our experiments, this may not be the case for rate matrices with widely 
varying transition rates. A possible approach is to 'learn' a good setting of this parameter via 
adaptive MCMC methods. More fundamentally, it would be interesting to investigate if theoretical 
claims can be made about the 'best' setting of this parameter under some measures of mixing speed 
and computational cost. 

Next, there are a number of immediate generalizations of our sampler. First, our algorithm 
is easily applicable to inhomogeneous Markov jump processes where techniques based on matrix 
exponentiation cannot be applied. Following recent work ( Rao and Teh[|20lTb ), we can also look 
at generalizing our sampler to semi-Markov processes where the holding times of the states follow 
non-exponential distributions. These models find applications in fields like biostatistics, neuro- 
science and queuing theory ( |Mode and Pickens , 1988). By combining our technique with slice 
sampling ideas ( Neal[ 2003[ ), we can explore Markov jump processes with countably infinite state 
spaces. Another generalization concerns MJPs with unbounded rate matrices. For the predator- 
prey model, we avoided this problem by bounding the maximum population sizes; otherwise it is 
impossible to choose a dominating fi. Of course, we know that any trajectory from this process 
is bounded with probability 1, consequently it is possible to extend our method to this case by 
treating f2 as a trajectory dependent random variable. 
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